Language identification of code Switching sentences and multilingual sentences of under-resourced languages by using multi structural word information

نویسندگان

  • Yin-Lai Yeong
  • Tien Ping Tan
چکیده

Language identification (LID) is a process to identify the languages used in a text or speech. Code switching is the switching of a language in a sentence or speech utterance. This paper focuses on LID of words in code switching sentences. Code switching can occur intersentential or intrasentential. The reasons why a writer switches from one language to another due to various reasons and among them are the inability to express opinion in a particular target language, to attract attention, to address different audience, habitual expressions and so on. The difficulty in identifying the languages of each word in a code switching sentence is because the languages have the same character set. In addition, code switching can happen in a sentence as short as a word or as long as a sentence. In this paper, we propose an automatic LID for words in code switching sentences by using multi structural word information (MUSWI) such as grapheme, syllable and word structure and calculate by using n-gram statistical model. The proposed MUSWI approach achieves 96.36% in term of accuracy on the code switching sentences, 99.07% on the multilingual sentences (non-code switching) which are under-resourced and closely related languages. Index Term: language identification, code switching, n-gram, multilingual, under-resourced languages, closely related languages

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Investigation into the Effective Factors in Comprehending English Garden-Path Sentences by EFL Learners

The present study aimed at highlighting the possible effects of age, proficiency level, and the structural composition of Garden-Path (GP) sentences on EFL learners' comprehension. 80 Iranian EFL learners were recruited from the initial pool of 114 participants based on the results of an English proficiency test; 40 advanced, and 40 intermediate learners were selected. Moreover, two age...

متن کامل

Motivational Determinants of Code-Switching in Iranian EFL Classrooms

“Code-Switching”, an important issue in the field of both language classroom and sociolinguistics, has been under consideration in investigations related to bilingual and multilingual societies. First proposed by Haugen (1956) and later developed byGrosjean (1982), the termcode-switching refers to language alternation during communication. Although code-switching is unavoidable in bilingual and...

متن کامل

Word-Forming Process in Azeri Turkish Language

The subject intended to study the general methods of natural word-forming in Azeri Turkish language. This study aimed to reach this purpose by analyzing the construction of compound Azeri Turkish words. Same’ei (2016) did a comprehensive study on word-forming process in Farsi, which was the inspiration source of this study for Azeri Turkish language word-forming. Numerous scholars had done vari...

متن کامل

A Language Modeling Approach to Identifying Code-Switched Sentences and Words

Globalization and multilingualism contribute to code-switching – the phenomenon in which speakers produce utterances containing words or expressions from a second language. Processing code-switched sentences is a significant challenge for multilingual intelligent systems. This study proposes a language modeling approach to the problem of codeswitching language processing, dividing the problem i...

متن کامل

Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences . Experiments prove that these corpora can be effectively used as training sets for supe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014